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Abstract 

Tandem repeats (microsatellites or SSRs) are molecular markers with great potential for plant genetic studies. Mod- 
ern strategies include the transfer of these markers among widely studied and orphan species. In silico analyses al- 
low for studying distribution patterns of microsatellites and predicting which motifs would be more amenable to 
interspecies transfer. Transcribed sequences (Unigene) from ten species of three plant families were surveyed for 
the occurrence of micro and minisatellites. Transcripts from different species displayed different rates of tandem re- 
peat occurrence, ranging from 1 .47% to 1 1 .28%. Both similar and different patterns were found within and among 
plant families. The results also indicate a lack of association between genome size and tandem repeat fractions in 
expressed regions. The conservation of motifs among species and its implication on genome evolution and dynam- 
ics are discussed. 
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Introduction 

Microsatellites or SSRs (Simple sequence repeats) 
are DNA sequences formed by the tandem arrangement of 
nucleotides through the combination of one to six base 
pairs, being widely distributed in prokaryote and 
eukaryote genomes (Morgante and Olivieri, 1993; Toth et 
ah, 2000). Microsatellite regions tend to form loops or 
hairpin structures, leading to the slippage of DNA poly- 
merase during replication, thereby provoking the insertion 
or deletion of nucleotides (Iyer et ah, 2000). The expan- 
sion and/or contraction of microsatellites may lead to a 
gain or loss of gene function (Li et ah, 2002, 2004a). Ini- 
tially, it was suggested that the occurrence and distribu- 
tion of microsatellites could be the result of random 
processes. However, new evidence indicates that the 
genomic distribution of these repeats had its origin in 
non-random processes (Bell, 1996; Li et ah, 2004b). 
Microsatellites have been reported to correspond to 0.85% 
of Arabidopsis (Arabidopsis thaliana), 0.37% of maize 
(Zea mays subsp. mays), 3.21% of fugu fish (Fugu 
rubripes), 0.21% of the nematode Caenorhabditis elegans 
and 0.30% of yeast (Saccharomyces cerevisae) genomes 
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(Morgante et ah, 2002). Moreover, they constitute 3.00% 
of the human genome (Subramanian et ah, 2003). 

For microsatellites located in genie regions, 5'UTRs 
are hotspots for the presence of this type of repeats. It is 
known that the contraction and/or expansion of repeats 
found in 5 'UTR regions alter the transcription and/or trans- 
lation of these genes (Li et ah, 2004b; Zhang et ah, 2006a). 
Mutations in microsatellite loci found in 3 'UTR regions are 
associated with gene silencing, transcript-cytosol exporting 
and splicing mechanism changes as well as the expression 
levels of flanking genes (Davis et ah , 1 997; Thornton et ah , 
1997; Philips et ah, 1998; Conne et ah, 2000). For coding 
sequences (CDS), the impact of mutations has been de- 
scribed as functional changes, loss of function and protein 
truncation (Li et ah, 2004b). Although much has been re- 
ported on microsatellites frequencies in transcribed regions 
in plants (Temnykh et ah, 2001; McCouch et ah, 2002; 
Morgante et ah , 2002; Thiel et ah , 2003, Nicot et ah , 2004; 
Kashi and King, 2006; Lawon and Zhang, 2006; Varshney 
et ah, 2006; Zhang et ah, 2006b), additional comparative or 
descriptive analysis can offer novel perspectives on their 
use as molecular markers. The genomic abundance of 
microsatellites, and their ability to associate with many 
phenotypes, make this class of molecular markers a power- 
ful tool for diverse application in plant genetics. The identi- 
fication of microsatellite markers derived from EST and/or 
cDNAs, and described as functional markers, represents an 
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even more useful possibility for these markers when com- 
pared to those based on assessing anonymous regions 
(Varshney et al, 2005, 2006). 

In order to provide information regarding the patterns 
of microsatellite occurrence and distribution on transcribed 
genome regions, non-redundant full-length cDNAs (fl- 
cDNAs) and/or ESTs belonging to ten plant species from 
three different families (Brassicaceae, Solanaceae and 
Poaceae) were used. 

Material and Methods 

Obtaining the expressed sequence 

Files containing expressed sequences were obtained 
for the following families/species: Brassicaceae 
(Arabidopsis thaliana and Brassica napus), Solanaceae 
{Solanum lycopersicum and Solatium tuberosum) and 
Poaceae (Oryza sativa, Sorghum bicolor, Triticum 
aestivum, Zea mays, Saccharum officinarum and Hordeum 
vulgare), all deposited in the NCBI-Unigene data-base. 
Non-redundant yet representative sequences for all known 
genes in each species were selected. The sequences used in 
the present study were downloaded from the Unigene data- 
base in June, 2008. 

Distribution of sequences in different transcribed 
regions 

By using computer scripts developed in Perl language 
and based on the existing annotation for each of the cDNAs 
and/or ESTs sequences, the sequences were categorized as 
CDS, upstream and downstream regions, partitioned into 
fasta files and denominated CDS, 5' UTR and 3' UTR for 
each species. Since the annotation of introns was not part of 
the database, the repeats present in intronic regions were 
not considered in this study. 

Location of tandem repeats 

SSRLocator software was used (Maia et al. , 2008) for 
the location of tandem repeats. Software options were ad- 
justed to locate monomers, dimers, trimers, pentamers and 
hexamers containing a minimum of 10, 7, 5, 4 and 4 re- 
peats, respectively. For mini-satellites, heptamer, octamer, 
nonamer and decamers containing a minimum of 3 , 3 , 3 and 
2 repeats, respectively, were selected. 

Results and Discussion 

Distribution of sequences in UTRs and CDSs 

The sequences, separated into coding (CDS) and un- 
translated (5 'UTR and 3 'UTR) regions, and distributed by 
number of sequences, amount (Mb) and average size (bp) 
for all the ten species, are shown in Table 1 . On an average 
and in all of these, there were sequence fragments between 
560 and 893 bp long, except for the A. thaliana and O. 
sativa databases, where they were longer, reaching aver- 



(Li c« O 

rr, (11 T A 



(N in (N 00 

O -ct in m 
(N (N CN CN 



m m c\ 
xi -^f O O in \q 

d cs n ci rs m 



. (N CN O 
o o ^ ^ 



oo n « 



t> — 

o o o ^ 



t> HI l/"> HI HI 

as m m — • oo t> 
^c> t> IT) oo oo oo 



(N r- OO (N iTi 



(NC-sOOOOOOC^<^<^<^a>t^ 



oo m m e> 

— oo ^ HI 

o\ <n a\ >n 

aC \o \o aC 

CN CN — 



o 
>n 




in 
© 


r- 


■o 
cc 


oo 


in 


CN 


in 


in 




in 




in 


© 


rn 


m 


IT) 


in 


<n 


tN 



as cn \o 



oo-— " m m «n i— ■ cn hi ' cn c?s 



n « 5 



h~i h~i *o 



oo oo 
o ^ 



© 



m oo oo e> 



cn ^ <— < 



oo in in a> 

— ' OO -Ct HI 

G\ cn >n 

oC oC 

CN CN "— 1 "— 1 



in CN CN 
CN 



t> rn 
cn in 

rH rH CN 



o 
in 




in 
o 




OC 


■DC' 


in 


CN 


in 


in 




in 




in 


o 


m" 


m 


in 


in" 


CN 


ri 







"El 












sndvu 








p 




thai i 


§■ 




sati} 














O 







2 <& 3 C 

Nj to < 



tH g 

° g 

-2 o 

i * 

T3 o 

'> aj 

M N 



o a. 

5 « 
» 43 



3 60 - 

o fcr ■ 

„ D 1 



3 -O 



< I 
id 

o a-- 



9 Q o ° 

- BO •! J3 

D c 



CO o, o 



I 'I 

o H 

g ? 

3 i/l 



H b T 

3 

M . 03 u 
) O" ,« 5a 

o p4 
2 | D 



< 
z 



g H 

3 . . 

B (Si 

o r 



^ o a 
« j S 
b <m 43 

3 B 
cr B _r 



5 



C/3 OJ 
_ CO 
CO — 

o 2 
H (2 



.a 

-O CJ T3 



" oj 6 
bo SD_ 
.5 S £ 



w O CD 13 



CO to 



pq 



OX) 



824 



Gene transcripts in three plant families 



ages of 1,447 and 1,490 bp, respectively. The number of se- 
quences deposited in Unigene was the largest for both of 
the Poaceae species Z. mays and O. sativa, with 57,447 and 
40,259, respectively. It is worthy of note that not all se- 
quences deposited in this database contain 5'UTR and 
3 'UTR regions, for in some both types are found, whereas 
in others only one is {i.e., 5' or 3 'UTR). The overall average 
sizes were found to be 130 bp for 5'UTR, 873 bp for CDS 
and 270 bp for 3 'UTR regions. The total nucleotides allo- 
cated to each were, on an average, 0.9% for 5'UTR, 97.5% 
for CDS and 1.6% for 3 'UTR. The only species with con- 
trasting values was Arabidopsis, where 6.8%, 82.6% and 
10.7% of total nucleotides were allocated to 5'UTR, CDS 
and 3 'UTR regions, respectively. 

Percentage of expressed sequences with tandem 
repeats 

On an average, 3.55% of analyzed sequences contain 
one or more loci with tandem repeats. The respective per- 
centages for each species are shown in Figure 1 . The high- 
est were for rice (11.28%), and the lowest for the 
Solanaceae species S. lycopersicum and S. tuberosum, i.e., 
1,47% and 1,76%, respectively. The percentage found for 
Arabidopsis (3.88%) is in agreement with other reports of 
between 3% and 5% (Cardie et al., 2000; Kumpatla and 
Mukhopadhyay, 2005). For B. napus, S. lycopersicon and 
S. tuberosum 2.42%, 1.47% and 1.76% of these sequences 
were found, respectively. However, different values (6.9%, 
4.7% and 2.65%, respectively) have been reported (Kum- 
patla and Mukhopadhyay, 2005). For the Poaceae, a com- 
parison of present results with former reports for//, vulgare 
(4.25% vs. 8.11%), Z. mays (2.14% vs. 1.5%), O. sativa 
(1 1.28% vs. 4.7%), S. offwinarum (2.13% vs. 2.9%) and T. 
aestivum (2.38% vs. 7.5%) show a different range of values 
(Cordeiro et al, 2001; Kantety et al, 2002; Thiel et al., 
2003; Nicot et al, 2004; Asp et al, 2007). Nevertheless, all 
differences are within the 2-3 fold range. 
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Figure 1 - Percentage of expressed sequences containing tandem repeat 
loci. 



The variations encountered in different reports are re- 
lated to the strategy employed by the authors (software, re- 
peat number and type defined for the search). However, by 
common agreement, microsatellite stretches with mini- 
mum sizes of 20 bp are present in approximately 2%-5% of 
cereal EST sequences (Varshney et al, 2005). 

Frequency of tandem repeats in UTR and CDS 
regions 

Results for total occurrence (total loci), percentage 
per region (the amount of loci per region divided by their 
total number) and frequencies (amount of loci per 
megabase) are shown separately for each species and by 
genie region (5'UTR, CDS and 3 'UTR) in Table 2. In the 
5'UTR and 3'UTR regions, 4.92% (529 loci) and 2.21% 
(237 loci), respectively, of all repeats were found in all the 
surveyed species (10,731 loci), with an average frequency 
of 1.3 and 0.7 loci/Mb, respectively. In coding regions 
(CDS), a higher occurrence of micro and minisatellites was 
detected, this reaching 92.86% of the total loci found (9,965 



Table 2 - Overall distribution of tandem repeat occurrences in translated and non-translated transcripts. 







5' UTR 






CDS 






3' UTR 




Total 




Occurrence 


% 


ssr/Mb 


Occurrence 


% 


ssr/Mb 


Occurrence 


% 


ssr/Mb 


Occurrence 


ssr/Mb 


A. thaliana 


395 


34.0 


9.1 


610 


52.5 


14.1 


157 


13.5 


3.6 


1,162 


27 


B. napus 


1 


0.2 


0.0 


632 


99.5 


31.1 


2 


0.3 


0.1 


635 


31 


S. lycopersicum 


6 


2.4 


0.4 


234 


94.0 


16.8 


9 


3.6 


0.6 


249 


18 


S. tuberosum 


4 


1.2 


0.3 


336 


97.7 


21.6 


4 


1.2 


0.3 


344 


22 


O. sativa 


78 


1.7 


1.3 


4,433 


97.6 


73.9 


29 


0.6 


0.5 


4,540 


76 


S. bicohr 


3 


0.6 


0.3 


505 


99.4 


53.3 


0 


0.0 


0.0 


508 


54 


T. aestivum 


11 


1.3 


0.4 


795 


97.0 


30.4 


14 


1.7 


0.5 


820 


31 


Z. mays 


12 


1.0 


0.4 


1,205 


98.0 


37.4 


13 


1.1 


0.4 


1,230 


38 


S. officinarum 


0 


0.0 


0.0 


332 


100.0 


26.1 


0 


0.0 


0.0 


332 


26 


H. vulgare 


19 


2.1 


1.0 


883 


96.9 


46.2 


9 


1.0 


0.5 


911 


48 


Average 


529 


4.9 


1.3 


9,965 


92.9 


35.1 


237 


2.2 


0.7 


10,731 


37 
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occurrences) with an average frequency of 35.1 loci/Mb. 
The higher percentage of repeats occurred in CDS regions 
as a consequence of the trimers present in this region. How- 
ever, for Arabidops is, high percentages of dimer (17.9%), 
trimer (19.3%) and total (44.5%) microsatellites were 
found in UTR regions, thus contrasting with the other spe- 
cies (Table 3). For the Rosaceae, between 44.3% and 
53.2% of the microsatellites were found in UTR regions 
(Jung et al. , 2005). For Arabidopsis, 8 1 % and 26%, respec- 
tively, of dimers and trimers were found in UTR regions 
(Yu et al, 2004). 

In the present study, a very high percentage of micro- 
satellites in 5'UTRs were detected in Arabidopsis, with a 
frequency of 9. 1 loci/Mb. These repeats represented 34% 
of all the 1,162 found in the 29,918 sequences analyzed in 
this species. The second and third highest frequencies of re- 
peats in these regions were encountered in the species O. 
saliva and//, vulgare, with an average 1.3 and 1.0 loci/Mb, 
respectively (Table 2). 

Many studies indicate the UTR regions as being more 
abundant in microsatellites than CDS regions (Morgante et 
al, 2002). In the present work, 92.86% of microsatellite 
loci in CDS regions are due to a deficiency in annotation 
when separating translated from non-translated fractions in 
the Unigene transcript database. 

As observed for 5'UTRs, contrasting values were 
also found in 3 'UTR regions. Much higher values were en- 
countered in Arabidopsis (an average of 3.6 loci/Mb) when 
compared to those below 0.6 loci/Mb in the remaining spe- 
cies (Table 2). 

On considering the overall occurrence of 5'UTRs, 
3 'UTRs and CDSs in all species, the average frequency ob- 
served is 37 loci/Mb. Values normally range from 
18 loci/Mb in tomato to 76 in rice. Average frequency val- 
ues per family are 29.0 loci/Mb in the Brassicaceae, 19.9 in 
the Solanaceae and 45.4 in the Poaceae (Table 2). 

Several reports have indicated values higher than 
those found in this study, i.e., 112-133 loci/Mb in barley, 
133 loci/Mb in maize, 94-161 loci/Mb in wheat, 
158-169 loci/Mb in sorghum, 161 loci/Mb in rye, 
256-277 loci/Mb in rice and 133 loci/Mb in Arabidopsis 
(Varshney et al, 2002; Thiel et al, 2003; Parida et al, 
2006). In Citrus species, values as high as 507 loci/Mb have 
been described in EST sequences (Palmieri et al, 2007). 
Values as high as 125 loci/Mb were also noted in Brassica 
rapa (Hong et al, 2007). Frequency values closer to our 
study have been reported for the CDS regions in Rosa 
chinensis (Rose), Primus dulcis (Almond), Prunus persica 
(Peach) and Arabidopsis, with values ranging from 39 to 
78 loci/Mb (Jung et al, 2005). 

Percentage occurrence of different microsatellite 
types in the UTR and CDS regions 

The detailed percentage values for each repeat type in 
the diverse sections of a genie region are listed for each spe- 



cies in Table 3. The average occurrence of dimer micro- 
satellites in all the species was 2 1 .9%, the majority of these 
loci being present in the CDS regions. The average percent- 
age of dimer occurrence for each family was 31.5% in 
Brassicaceae, 21.7% in Solanaceae and 18.8% in Poaceae 
species. The percentage values for dimer microsatellites in 
CDS regions ranged from 4.0% in Arabidopsis to 40.8% in 
B. napus. An interesting feature which seems to be specific 
for the Arabidopsis genome is the high occurrence of dimer 
microsatellites in the 5' and 3' UTR regions (13,6% and 
4,3%, respectively). In the Poaceae, dimer microsatellites 
ranged from 15.4% in barley to 27.3% in wheat (Table 3). 
Other studies indicated that the highest dimer occurrence 
rates are generally associated with 5 'UTR regions (Mor- 
gante et al, 2002; Lawson and Zhang, 2006; Hong et al, 
2007), but one should bear in mind that this prevalence in 
CDS regions may be a consequence of deficient database 
annotation. Trimer microsatellites were found in 40.2% of 
the sequences, with a high predominance in CDS regions. 
The species with higher trimer values were Arabidopsis, 
rice and tomato, with 58.0%, 54.7% and 41.4% of occur- 
rence, respectively. The average percentage of trimers 
within each family was 47.0% in the Brassicaceae, 37.8% 
in the Solanaceae and 38.7% in the Poaceae. Among 
Poaceae species, the highest percentage of trimer occur- 
rence was found in rice (54.7%) and the lowest in maize 
(34.6%). In Brassicaceae, trimers were found more fre- 
quently in Arabidopsis (58.0%) and less so in B. napus 
(36.1%) (Table 3). 

On an average, tetramers represented 8.2% of the mi- 
crosatellites, with average frequencies of 3.4%, 4.4% and 
11.0% in Brassicaceae, Solanaceae and Poaceae, respec- 
tively. Among the Brassicaceae, a less than one-fold differ- 
ence in frequencies was observed between Arabidopsis 
(2.9%) and B. napus (4.4%). In Poaceae, a 2.7-fold differ- 
ence was found between rice (6.1%) and barley (16.5%). 

On an average, pentamers represented 10.36% of the 
microsatellites, with average frequencies of 4.5%, 6.6% 
and 13.6% in the Brassicaceae, Solanaceae and Poaceae, 
respectively (Table 3). Less than one-fold differences were 
found between Brassicaceae and Solanaceae species. Nev- 
ertheless, in the Poaceae a 1.7-fold difference was found 
between rice (9.7%) and maize (16.5%). 

On an average, hexamers represented 13.8% of the 
microsatellites, with average frequencies of 8.1%, 19.1% 
and 13% in the Brassicaceae, Solanaceae and Poaceae, re- 
spectively. In the Poaceae, a 2.4-fold difference was found 
between wheat (7.7%) and sorghum (18.3%). 

Mini-satellite frequencies were also assessed from 
the available data (Table 3). On an average, heptamers rep- 
resented 4.5% of the total occurrence (mini-satellite plus 
microsatellite). These types of repeats were more common 
in the Solanaceae family (9.6%). In both the Brassicaceae 
and Poaceae, the average frequencies of heptamers were 
3.3% and 3.2%, respectively. Octamers were more frequent 
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in the Brassicaceae (0.8%), when compared to the Sola- 
naceae (0.3%) and Poaceae (0.1%). Nonamers were also 
more frequent in the Brassicaceae (0.9%), when compared 
to the Solanaceae (0.6%) and Poaceae (0.5%). Decamers 
were comparatively less frequent than other mini-satellites, 
reaching frequencies of 0.2%, 0.1% and zero in the Brassi- 
caceae, Poaceae and Solanaceae, respectively (Table 3). 

There are several studies proclaiming EST sequences 
containing microsatellites. For the Poaceae (rice, maize, 
sorghum, barley and wheat), frequencies ranging from 16.6 
to 40% for dimers, 41 to 78% for trimers, 2.6 to 14% for 
tetramers, 0.4 to 18.9% for pentamers and below 1% for 
hexamers ( Varshney etal, 2002 ; Thiel e/ a/. , 2003 ; La Rota 
et al, 2005; Parida et al, 2006) have been reported. In the 
case of 'Arabidopsis, frequencies of dimers (36.5%), trimers 
(62.1%), tetramers (1.1%), pentamers (0.15%) and 
hexamers (0.13%) have been noted (Parida et al, 2006). 

Most frequent motifs 

Dimers and trimers 

Motif frequencies per species and average frequency 
per family are listed in Tables 4 and 5. For dimers, differ- 
ences were observed within and between families. As re- 
gards the Brassicaceae, AG/CT and GA/TC dimer motifs 
were the most frequent, reaching 9.69% and 8.89% of ob- 
servations within the family. A 6.9-fold difference was the 
case for AG/CT between Arabidopsis (2.46%) and B. 
napus (16.93%). Moreover, as to the GA/TC motif, an al- 
most 10-fold difference was found between Arabidopsis 
(1.64%) and .5. napus (16.14%). Other reports have shown 
that AG/GA motifs were the most frequent in Arabidopsis 
(Cardie et al, 2000; Morgante et al, 2002; Lawson and 
Zhang, 2006; Parida et al, 2006) and AT/TA in B. rapa 
(Hong et al, 2007). Among the Solanaceae, AT/AT and 
TA/TA motifs were the most frequent, with frequencies of 
8.29% and 5.69%, respectively. In Solanaceae ESTs, fre- 
quencies between 20%-25% and 15%-20% were found for 
AG and AT dimers, respectively (Kumptla and Mukho- 
padhyay, 2005). In the Poaceae, the most frequent motifs 
were AG/CT and GA/TC, with average percentages of 
6.72% and 5.61%, respectively. In still other studies, fre- 
quencies ranging from 38%-50% were the rule for the AG 
motif in maize, barley, rice, sorghum and wheat (Kantety et 
al, 2002; Morgante et al, 2002; Varshney et al, 2002; 
Thiel et al, 2003; Yu et al, 2004; La Rota et al, 2005) and 
frequencies of 50% for the AC motif in barley (Varshney et 
al, 2002). GA has also been shown to be the most abundant 
motif in grasses (Temnykh et al, 2001; Kantety et al, 
2002; Nicot et al, 2004; Parida et al, 2006). In all the spe- 
cies that were analyzed in the present study, the lowest fre- 
quencies were found for those motifs formed by guanine 
and cytosine (CG/GC), which were even absent in Bras- 
sicaceae and Solanaceae species. 

As was the case for dimers, in trimer frequencies mo- 
tif patterns are different within as well as between families 



(Table 4). Among the Brassicaceae, GAA/TTC and 
AAG/CTT motifs were the most abundant, reaching fre- 
quencies of 8.36% and 6.73%, respectively. Contrasting 
values were verified for GAA/TTC between Arabidopsis 
(12.13%) and B. napus (4.59%), also the case for 
AAG/CTT between Arabidopsis (9.51%) and B. napus 
(3.96%). Some reports have claimed that AAG is the most 
frequent for Arabidopsis and B. rapa (Morgante et al, 
2002; Hongefa/., 2007). In the Solanaceae, GAA/TCC and 
AGA/TCT were the most frequent, with values of 4.75% 
and 4.60%, respectively. For both, frequency values were 
higher in S. tuberosum. Similar results were obtained in 
Arabidopsis, B. napus, B.rapa, S. Lycopersicum and S. 
tuberosum (Kumptla and Mukhopadhyay, 2005), as well as 
in Citrus (Jiang et al, 2006) where AAG/AGA/GAA mo- 
tifs were the most frequent. In the Poaceae, the trimers 
CCG/CGG, CGC/GCG and GCC/GGC were the most fre- 
quent, corresponding to 5.89%, 5.85% and 5.06%, respec- 
tively, a total of 16.80% of all the microsatellites found. 
Within the family, different motifs were the most common, 
i.e., for O. sativa, S. bicolor and H. vulgare, CCG/CGG 
were predominant, for T. aestivum and S. officinarum 
GCC/GGC and for Z. mays CGC/GCG. Other studies have 
shown a predominance of CCG in the grass species Z. 
mays, H. vulgare, O. sativa, S. bicolor, T. aestivum, S. 
cereale and S. officinarum (Cordeiro et al, 2001; Kantety 
et al, 2002; Morgante et al, 2002; Varshney et al, 2002; 
Thiel et al, 2003; Nicot et al, 2004; Yu et al, 2004; La 
Rota et al, 2005; Peng and Lapitan, 2005). These motifs 
(CCG/CGG, CGC/GCG and GCC/GGC) seem to be less 
common in other families, where instead of values of 
around 16.8% (found for grasses), frequency was 0.56% in 
Brassicaceae and 0.36% in the Solanaceae. 

Tetramers, pentamers and hexamers 

For the loci formed by motifs longer than three nu- 
cleotides, only the ten highest average percentages for each 
family are shown (Tables 4 and 5). 

In Brassicaceae, tetramer motifs occurring at higher 
frequencies were AAGA/TCTT, AAAC/GTTT or 
GAAA/TTTC adding to 1.04% of all motifs found. Other 
reports indicate that motifs AAAG/AAAT were predomi- 
nant in Arabidopsis and AAAT in B. rapa (Cardie et al, 
2000; Hong et al, 2007). For 5'UTR/CDS and 3'UTR 
Arabidopsis regions, the predominant motifs reported were 
AAAG/CTTT and AAAC/GTTT, respectively (Morgante 
et al, 2002; Zhang et al, 2004). For Solanaceae species, 
1.96% of all motifs found were either TAAA/TTTA or 
TTAA/TTAA or AAGA/TCTT. These results agree with 
EST data from 20 dicot species (Kumptla and Mukho- 
padhyay, 2005). Among the grasses, 0.85% of all motifs 
were either CCTC/GAGG or AGGATCCT or 
CATC/GATG. Differences in predominant tetramer rates 
were found among the species (Table 4). Other reports have 
shown ACGT as the most abundant in barley (Varshney et 
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a/., 2002; Thiel et al, 2003), AAAG/CTTT and 
AAGG/CCTT in perennial ryegrass (Asp et al. 2007) and 
AAAG as the most frequent motif in rice BACs (McCouch 
etal, 2002). 

For pentamers, 0.80% (GAAAA/TTTTC , AAAAT/ 
ATTTT and AAAAC/GTTTT), 1.37% ( AAAAT/ ATTTT, 
AAAAG/CTTTT and AGAAG/CTTCT) and 0.83% 
(CTCTC/GAGAG, GAGGA/TCCTC and CTTCC/ 
GGAAG) were predominant in the Brassicaceae, Sola- 
naceae and Poaceae, respectively. The major difference 
among plant families is the predominance of A/T in the 
Brassicaceae and Solanaceae. Also, reports on CDS regions 
in Arabidopsis, S. cerevisae and C.elegans, indicated the 
predominance of ACCCG and AAAAG (Toth et al. 2000). 
For eukaryotes in general, AAAAT, AAAAC and AAAAG 
are revealed as the most predominant (Li et al., 2004a). On 
the other hand, 5'UTR and 3'UTR regions of Arabidopsis 
were shown to be rich in AAGAG and AAAAC, respec- 
tively (Zhang et al. , 2004). AAAAT (Hong et al. , 2007) and 
AAAAT /AAAAG (Jiang et al, 2006) were described as 
being frequently found in the Rosaceae and Citrus, respec- 
tively. In transcripts from the TIGR database, the AGAGG 
motif was predominant in rice, AGGGG in barley and 
ACGAT in wheat (La Rota et al, 2005). Very little infor- 
mation was encountered on the preferential occurrence of 
pentamers in grasses, whereas that on eukaryotes (Toth et 
al, 2000; Li et al, 2004a), Citrus (Palmieri et al, 2007; 
Jiang et al, 2006), Arabidopsis (Zhang et al, 2004) and 
Rosaceae (Hong et al, 2007) offered variable results. 

Hexamer patterns occurred among and within the 
three analyzed plant families (Table 5). To date, the pre- 
dominance of AAGGAG hexamers in Arabidopsis, has 
been confirmed by only one other study (Toth et al, 2000). 
Other reports indicated the most encountered hexamers to 
be AAGATG, AAAGAG and AAAAAT in Arabidopsis 
(Zhang et al, 2004), AAAAAG in Citrus (Jiang et al, 
2006), AACACG in S. cerevisae, ACCAGG in C. elegans, 
AAGGCC in mammals and CCCCGG in primates (Toth et 
al, 2000). The ten major occurrences for heptamers, octa- 
mers, nonamers and decamers are presented in Table 5. Oc- 
currences are widely variable within and among families, 
making it difficult to establish either a pattern or discussion 
based on similarities. 

Genome dynamics is very complex regarding micro- 
satellite motifs in plants. The higher conservation of dimer 
motifs (AG/TC and GA/TC) seems to overcome evolution- 
ary barriers distances such as those found between monocot 
and dicot plants. However, in the dicots, this conservation 
may not hold. Unexpectedly, Poaceae and Brassicaceae 
were closer when these motifs were analyzed. On the other 
hand, trimer microsatellites that are known to be predomi- 
nant in coding regions followed the expected conservation 
pattern, with similar rates and predominant motifs 
(GAA/TTC) between the two dicot families. Trimers pres- 
ent at higher frequencies in the grasses tend to be formed by 
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G/C arrangements, in contrast to dicot plants where 
G/A/T/C combinations are more frequent. The higher fre- 
quency of A/T- rich repeats is also found in pentamer mo- 
tifs in the dicot families. Repeats of higher complexity did 
not reveal detectable conserved patterns in this study. 



Conclusions 
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The occurrence of micro and minisatellites in rice se- 
quences (1 1.28%) is higher than in other species, ranging 
from 2.5 to 5 times more sequences containing these repeti- 
tive DNA loci. The fact that species having larger genomes 
(T. aestivum, H. vulgare and S. officinarum) do not present 
a correspondingly higher frequency of repetitive loci im- 
plies there is no relationship between genome size and rates 
of tandem repeat occurrence in functional regions. How- 
ever, the lower coverage of sequences present in databases 
for these species could also be a reason for the low rates 
found in some species. For Arabidopsis and rice, the results 
obtained are closer to reality, since both are considered 
model species and have been intensely studied. 

The distribution of micro- and minisatellites was 
higher in CDS regions for all the studied species. Also, 
microsatellites (97%) were more common than mini- 
satellites (3%). Per family, the predominant dimer motifs 
were the same for Brassicaceae and Poaceae (AG/CT) and 
different for the Solanaceae (AT/AT). Trimers were the 
predominant repeats, ranging between 34.3% and 58.0%, 
with different rates depending on the family or species. For 
the Solanaceae, the predominant trimer motifs were not the 
same for S. lycopersicum (ATA/TAT and AAT/TTA) and 
S. tuberosum (GAA/TTC and AGA/TCT). This could be 
due to selection. Among the grasses, trimers formed by C/G 
were the most abundant. Nevertheless, specific motifs were 
variable between species. 

Disagreements between earlier reports and the results 
obtained in the present work, where dimers were also fre- 
quent in CDS regions, could be due to the fact that the 
Unigene database contains predominantly EST clusters. 
Therefore, there is a tendency for under-representing the 
UTR regions in the annotated sequences. This is true for all 
species, except Arabidopsis. This could be solved by manu- 
ally curating the genes, thereby defining the different re- 
gions. Achievement, however, would require a community 
effort. 

The obtained results shed light on the patterns of tan- 
dem repeat occurrence within and between different plant 
families, thereby facilitating the use of plant-breeding strat- 
egies based on the transfer of markers from model to or- 
phan species. 
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